Goto

Collaborating Authors

 recovery function




Reviews: Recovering Bandits

Neural Information Processing Systems

The major point that needs to be clarified for me is the distinction between single play regret and multiple play regret. The UCB-Z algorithm is not of this type for example. It is a bit weird to have regret notion that only apply to very specific algorithms - second, regarding the distinction between single and multiple play regret, am I right that E[R_T {(d,m)}] E[R_T {(d)}]? That is, you would in principle care about the multiple play lookahead regret, as one should be allowed to be play an arm multiple times during the d-time step on which we optimize. From my understanding, it would be defined only for strategies which selects each arm at most once during d steps (and therefore d cannot be larger than K in this case, right?).


Dynamic Planning and Learning under Recovering Rewards

Simchi-Levi, David, Zheng, Zeyu, Zhu, Feng

arXiv.org Machine Learning

Motivated by emerging applications such as live-streaming e-commerce, promotions and recommendations, we introduce a general class of multi-armed bandit problems that have the following two features: (i) the decision maker can pull and collect rewards from at most $K$ out of $N$ different arms in each time period; (ii) the expected reward of an arm immediately drops after it is pulled, and then non parametrically recovers as the idle time increases. With the objective of maximizing expected cumulative rewards over $T$ time periods, we propose, construct and prove performance guarantees for a class of "Purely Periodic Policies". For the offline problem when all model parameters are known, our proposed policy obtains an approximation ratio that is at the order of $1-\mathcal O(1/\sqrt{K})$, which is asymptotically optimal when $K$ grows to infinity. For the online problem when the model parameters are unknown and need to be learned, we design an Upper Confidence Bound (UCB) based policy that approximately has $\widetilde{\mathcal O}(N\sqrt{T})$ regret against the offline benchmark. Our framework and policy design may have the potential to be adapted into other offline planning and online learning applications with non-stationary and recovering rewards.


Recovering Bandits

Pike-Burke, Ciara, Grünewälder, Steffen

arXiv.org Machine Learning

We study the recovering bandits problem, a variant of the stochastic multi-armed bandit problem where the expected reward of each arm varies according to some unknown function of the time since the arm was last played. While being a natural extension of the classical bandit problem that arises in many real-world settings, this variation is accompanied by significant difficulties. In particular, methods need to plan ahead and estimate many more quantities than in the classical bandit setting. In this work, we explore the use of Gaussian processes to tackle the estimation and planing problem. We also discuss different regret definitions that let us quantify the performance of the methods. To improve computational efficiency of the methods, we provide an optimistic planning approximation. We complement these discussions with regret bounds and empirical studies.